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Abstract 

Bayesian optimization is an effective methodol¬ 
ogy for the global optimization of functions with 
expensive evaluations. It relies on querying a dis¬ 
tribution over functions defined by a relatively 
cheap surrogate model. An accurate model for 
this distribution over functions is critical to the 
effectiveness of the approach, and is typically fit 
using Gaussian processes (GPs). However, since 
GPs scale cubically with the number of obser¬ 
vations, it has been challenging to handle objec¬ 
tives whose optimization requires many evalua¬ 
tions, and as such, massively parallelizing the op¬ 
timization. 

In this work, we explore the use of neural net¬ 
works as an alternative to GPs to model dis¬ 
tributions over functions. We show that per¬ 
forming adaptive basis function regression with a 
neural network as the parametric form performs 
competitively with state-of-the-art GP-based ap¬ 
proaches, but scales linearly with the number 
of data rather than cubically. This allows us to 
achieve a previously intractable degree of paral¬ 
lelism, which we apply to large scale hyperpa¬ 
rameter optimization, rapidly finding competitive 
models on benchmark object recognition tasks 
using convolutional networks, and image caption 
generation using neural language models. 
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1. Introduction 

Recently, the field of machine learning has seen unprece¬ 
dented growth due to a new wealth of data, increases in 
computational power, new algorithms, and a plethora of 
exciting new applications. As researchers tackle more am¬ 
bitious problems, the models they use are also becoming 
more sophisticated. However, the growing complexity of 
machine learning models inevitably comes with the intro¬ 
duction of additional hyperparameters. These range from 
design decisions such as the shape of a neural network 
architecture, to optimization parameters such as learning 
rates, to regularization hyperparameters such as weight de¬ 
cay. Proper setting of these hyperparameters is critical for 
performance on difficult problems. 

There are many methods for optimizing over hyperparame¬ 
ter settings, ranging from simplistic procedures like grid or 
random search (Bergstra & Bengio, 2012), to more sophis¬ 
ticated model-based approaches using random forests (Hut- 
ter et al., 2011) or Gaussian processes (Snoek et al., 2012). 
Bayesian optimization is a natural framework for model- 
based global optimization of noisy, expensive black-box 
functions. It offers a principled approach to modeling un¬ 
certainty, which allows exploration and exploitation to be 
naturally balanced during the search. Perhaps the most 
commonly used model for Bayesian optimization is the 
Gaussian process (GP) due to its simplicity and flexibility 
in terms of conditioning and inference. 

However, a major drawback of GP-based Bayesian opti¬ 
mization is that inference time grows cubically in the num¬ 
ber of observations, as it necessitates the inversion of a 
dense covariance matrix. For problems with a very small 
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number of hyperparameters, this has not been an issue, as 
the minimum is often discovered before the cubic scaling 
renders further evaluations prohibitive. As the complex¬ 
ity of machine learning models grows, however, the size 
of the search space grows as well, along with the number 
of hyperparameter configurations that need to be evaluated 
before a solution of sufficient quality is found. Fortunately, 
as models have grown in complexity, computation has be¬ 
come significantly more accessible and it is now possible 
to train many models in parallel. A natural solution to 
the hyperparameter search problem is to therefore combine 
large-scale parallelism with a scalable Bayesian optimiza¬ 
tion method. The cubic scaling of the GP, however, has 
made it infeasible to pursue this approach. 

The goal of this work is to develop a method for scaling 
Bayesian optimization, while still maintaining its desirable 
flexibility and characterization of uncertainty. To that end, 
we propose the use of neural networks to learn an adaptive 
set of basis functions for Bayesian linear regression. We re¬ 
fer to this approach as Deep Networks for Global Optimiza¬ 
tion (DNGO). Unlike a standard Gaussian process, DNGO 
scales linearly with the number of function evaluations— 
which, in the case of hyperparameter optimization, corre¬ 
sponds to the number of models trained—and is amenable 
to stochastic gradient training. Although it may seem that 
we are merely moving the problem of setting the hyper¬ 
parameters of the model being tuned to setting them for 
the tuner itself, we show that for a suitable set of design 
choices it is possible to create a robust, scalable, and effec¬ 
tive Bayesian optimization system that generalizes across 
many global optimization problems. 

We demonstrate the effectiveness of DNGO on a number 
of difficult problems, including benchmark problems for 
Bayesian optimization, convolutional neural networks for 
object recognition, and multi-modal neural language mod¬ 
els for image caption generation. We find hyperparameter 
settings that achieve competitive with state-of-the-art re¬ 
sults of 6.37% and 27.4% on ClFAR-10 and ClFAR-100 
respectively, and BLEU scores of 25.1 and 26.7 on the Mi¬ 
crosoft COCO 2014 dataset using a single model and a 3- 
model ensemble. 

2. Background and Related Work 

2.1. Bayesian Optimization 

Bayesian optimization is a well-established strategy for the 
global optimization of noisy, expensive black-box func¬ 
tions (Mockus et al., 1978). For an in-depth review, see 
Lizotte (2008), Brochu et al. (2010) and Osborne et al. 
(2009). Bayesian optimization relies on the construction 
of a probabilistic model that defines a distribution over 
objective functions from the input space to the objective 


of interest. Conditioned on a prior over the functional 
form and a set of N observations of input-target pairs 
V = {{xn,yn)}n=i’ '^^e relatively cheap posterior over 
functions is then queried to reason about where to seek the 
optimum of the expensive function of interest. The promise 
of a new experiment is quantified using an acquisition func¬ 
tion, which, applied to the posterior mean and variance, 
expresses a trade-off between exploration and exploitation. 
Bayesian optimization proceeds by performing a proxy op¬ 
timization over this acquisition function in order to deter¬ 
mine the next input to evaluate. 

Recent innovation has resulted in significant progress in 
Bayesian optimization, including elegant theoretical re¬ 
sults (Srinivas et al., 2010; Bull, 2011; de Freitas et al., 
2012), multitask and transfer optimization (Krause & Ong, 
2011; Swersky et al., 2013; Bardenet et al., 2013) and 
the application to diverse tasks such as sensor set selec¬ 
tion (Garnett et al., 2010), the tuning of adaptive Monte 
Carlo (Mahendran et al., 2012) and robotic gait control (Ca- 
landra et al., 2014b). 

Typically, GPs have been used to construct the distribu¬ 
tion over functions used in Bayesian optimization, due to 
their flexibility, well-calibrated uncertainty, and analytic 
properties (Jones, 2001; Osborne et al., 2009). Recent 
work has sought to improve the performance of the GP ap¬ 
proach through accommodating higher dimensional prob¬ 
lems (Wang et al., 2013; Djolonga et al., 2013), input non- 
stationarities (Snoek et al., 2014) and initialization through 
meta-learning (Feurer et al., 2015). Random forests, which 
scale linearly with the data, have also been used success¬ 
fully for algorithm configuration by Flutter et al. (2011) 
with empirical estimates of model uncertainty. 

More specifically, Bayesian optimization seeks to solve the 
minimization problem 

X* = argmin/(x), (1) 

where we take A' to be a compact subset of In 

our work, we build upon the standard GP-based approach 
of Jones (2001) which uses a GP surrogate and the ex¬ 
pected improvement acquisition function (Mockus et al., 
1978). For the surrogate model hyperparameters 0, 
let (T^(x; 0) = E(x, x; 0) be the marginal predictive vari¬ 
ance of the probabilistic model, /r(x; T), 0) be the predic¬ 
tive mean, and define 

, , /(Xbest) - F(x;T>,0) 

=- TWITS) -■ 

where /(xbest) is the lowest observed value. The expected 
improvement criterion is defined as 

aEi(x;X>, 0)= (3) 

(t(x;X>, 0) [7(x)T>(7(x)) -f A/'(7 (x); 0,1)] . 
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Here $(•) is the cumulative distribution function of a stan¬ 
dard normal, and JV{-', 0,1) is the density of a standard nor¬ 
mal. Note that numerous alternate acquisition functions 
and combinations thereof have been proposed (Kushner, 
1964; Srinivas et ah, 2010; Hoffman et al., 2011), which 
could be used without affecting the analytic properties of 
our approach. 

2.2. Bayesian Neural Networks 

The application of Bayesian methods to neural networks 
has a rich history in machine learning (MacKay, 1992; Hin¬ 
ton & van Camp, 1993; Buntine & Weigend, 1991; Neal, 
1995; De Freitas, 2003). The goal of Bayesian neural net¬ 
works is to uncover the full posterior distribution over the 
network weights in order to capture uncertainty, to act as 
a regularizer, and to provide a framework for model com¬ 
parison. The full posterior is, however, intractable for most 
forms of neural networks, necessitating expensive approx¬ 
imate inference or Markov chain Monte Carlo simulation. 
More recently, full or approximate Bayesian inference has 
been considered for small pieces of the overall architecture. 
For example, in similar spirit to this work, Lazaro-Gredilla 
& Figueiras-Vidal (2010); Hinton & Salakhutdinov (2008) 
and Calandra et al. (2014a) considered inference over just 
the last layer of a neural network. Alternatively, variational 
approaches are developed in Kingma & Welling (2014); 
Rezende et al. (2014) and Mnih & Gregor (2014), where 
a neural network is used in a variational approximation to 
the posterior distribution over the latent variables of a di¬ 
rected generative neural network. 

3. Adaptive Basis Regression with Deep 
Neural Networks 

A key limitation of GP-based Bayesian optimization is that 
the computational cost of the technique scales cubically in 
the number of observations, limiting the applicability of the 
approach to objectives that require a relatively small num¬ 
ber of observations to optimize. In this work, we aim to 
replace the GP traditionally used in Bayesian optimization 
with a model that scales in a less dramatic fashion, but re¬ 
tains most of the GP’s desirable properties such as flexi¬ 
bility and well-calibrated uncertainty. Bayesian neural net¬ 
works are a natural consideration, not least because of the 
theoretical relationship between Gaussian processes and 
infinite Bayesian neural networks (Neal, 1995; Williams, 
1996). However, deploying these at a large scale is very 
computationally expensive. 

As such, we take a pragmatic approach and add a Bayesian 
linear regressor to the last hidden layer of a deep neural 
network, marginalizing only the output weights of the net 
while using a point estimate for the remaining parameters. 
This results in adaptive basis regression, a well-established 


statistical technique which scales linearly in the number 
of observations, and cubically in the basis function dimen¬ 
sionality. This allows us to explicitly trade off evaluation 
time and model capacity. As such, we form the basis using 
the very flexible and powerful non-linear functions defined 
by the neural network. 

First of all, without loss of generality and assuming com¬ 
pact support for each input dimension, we scale the in¬ 
put space to the unit hypercube. We denote by (/>(•) = 
..., 0 d(-)]^ the vector of outputs from the last 
hidden layer of the network, trained on inputs and tar¬ 
gets V := {(pi-niyn)}n=i ^ ^ We take these to 

be our set of basis functions. In addition, define $ to 
be the design matrix arising from the data and this basis, 
where ^nd = (f>d(xn) is the output design matrix, and y 
the stacked target vector. 

These basis functions are parameterized via the weights 
and biases of the deep neural network, and these param¬ 
eters are trained via backpropagation and stochastic gradi¬ 
ent descent with momentum. In this training phase, a linear 
output layer is also fit. This procedure can be viewed as a 
maximum a posteriori (MAP) estimate of all parameters 
in the network. Once this “basis function neural network” 
has been trained, we replace the MAP-parameterized out¬ 
put layer with a Bayesian linear regressor that captures un¬ 
certainty in the weights. See Section 3.1.2 for a more elab¬ 
orate explanation of this choice. 

The predictive mean /r(x; 0) and variance cr^ (x; 0) of the 
model are then given by (see Bishop, 2006) 

p(x;X>,0) = m'^0(x)-f ? 7 (x) , (4) 

CT^(x;X>,0) = (/)(x)'^K~V(x) + ^ (5) 

where 

m = /3K"^$'^y e (6) 

K = (7) 

Here, ? 7 (x) is a prior mean function which is described in 
Section 3.1.3, and y = y — ? 7 (x). In addition, a, f3 G & 
are regression model hyperparameters. We integrate out a 
and /? using slice sampling (Neal, 2000) according to the 
methodology of Snoek et al. (2012) over the marginal like¬ 
lihood, which is given by 

D N N 

logp(y I X, a, /3) = — log a -f y log ^ - y log(27r) 

- ^l|y-^m|P-^log|K| . (8) 

It is clear that the computational bottleneck of this proce¬ 
dure is the inversion of K. However, note that the size of 
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Figure 1. The time per suggested experiment for our method com¬ 
pared to the state-of-the-art GP based approach from Snoek et al. 
(2014) on the six dimensional Hartmann function. We ran each 
algorithm on the same 32 core system with 80GB of RAM five 
times and plot the mean and standard deviation. 

this matrix grows with the output dimensionality D, rather 
than the number of observations N as in the GP case. This 
allows us to scale to significantly more observations than 
with the GP as demonstrated in Figure 1 . 

3.1. Model details 

3.1.1. Network architecture 

A natural concern with the use of deep networks is that 
they often require significant effort to tune and tailor to spe¬ 
cific problems. One can consider adjusting the architecture 
and tuning the hyperparameters of the neural network as 
itself a difficult hyperparameter optimization problem. An 
additional challenge is that we aim to create an approach 
that generalizes across optimization problems. We found 
that design decisions such as the type of activation function 
used significantly altered the performance of the Bayesian 
optimization routine. For example, in Figure 2 we see that 
the commonly used rectified linear (ReLU) function can 
lead to very poor estimates of uncertainty, which causes 
the Bayesian optimization routine to explore excessively. 
Since the bounded tanh function results in smooth func¬ 
tions with realistic variance, we use this nonlinearity in this 
work; however, if the smoothness assumption needs to be 
relaxed, a combination of rectified linear functions with a 
tanh function only on the last layer can also be used in order 
to bound the basis. 

In order to tune any remaining hyperparameters, such as the 
width of the hidden layers and the amount of regulariza¬ 
tion, we used GP-based Bayesian optimization. For each 
of one to four layers we ran Bayesian optimization using 


the Spearmint (Snoek et al., 2014) package to minimize the 
average relative loss on a series of benchmark global op¬ 
timization problems. We tuned a global learning rate, mo¬ 
mentum, layer sizes, ^2 normalization penalties for each set 
of weights and dropout rates (Hinton et al., 2012) for each 
layer. Interestingly, the optimal configuration featured no 
dropout and very modest ^2 normalization. We suspect that 
dropout, despite having an approximate correction term, 
causes noise in the predicted mean resulting in a loss of pre¬ 
cision. The optimizer instead preferred to restrict capacity 
via a small number of hidden units. Namely, the optimal 
architecture is a deep and narrow network with 3 hidden 
layers and approximately 50 hidden units per layer. We use 
the same architecture throughout our empirical evaluation, 
and this architecture is illustrated in Figure 2(d). 

3.1.2. Marginal likelihood vs MAP estimate 

The standard empirical Bayesian approach to adaptive ba¬ 
sis regression is to maximize the marginal likelihood with 
respect to the parameters of the basis (see Equation 8), thus 
taking the model uncertainty into account. However, in 
the context of our method, this requires evaluating the gra¬ 
dient of the marginal likelihood, which requires inverting 
a D X D matrix on each update of stochastic gradient de¬ 
scent. As this makes the optimization of the net signifi¬ 
cantly slower, we take a pragmatic approach and optimize 
the basis using a point estimate and apply the Bayesian 
linear regression layer post-hoc. We found that both ap¬ 
proaches gave qualitatively and empirically similar results, 
and as such we in practice employ the more efficient one. 

3.1.3. Quadratic Prior 

One of the advantages of Bayesian optimization is that it 
provides natural avenues for incorporating prior informa¬ 
tion about the objective function and search space. For ex¬ 
ample, when choosing the boundaries of the search space, 
a typical assumption has been that the optimal solution lies 
somewhere in the interior of the input space. However, by 
the curse of dimensionality, most of the volume of the space 
lies very close to its boundaries. Therefore, we select a 
mean function 77(x) (see Equation 4) to reflect our subjec¬ 
tive prior beliefs that the function is coarsely approximated 
by a convex quadratic function centered in the bounded 
search region, i.e., 

? 7 (x) = A + (x - c)^A(x - c) (9) 

where c is the center of the quadratic, A is an offset and A 
a diagonal scaling matrix. We place a Gaussian prior with 
mean 0.5 (the center of the unit hypercube) on c, horse¬ 
shoe (Carvalho et al., 2009) priors on the diagonal elements 
Afcfe Vfc G ,K} and integrate out b, A and c using 

slice sampling over the marginal likelihood. 
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Figure 2. A comparison of the predictive mean and uncertainty learned by the model when using 2(a) only tanh, 2(c) only rectified linear 
(ReLU) activation functions or 2(b) ReLU’s but a tanh on the last hidden layer. The shaded regions correspond to standard deviation 
envelopes around the mean. The choice of activation function significantly modifies the basis functions learned by the model. Although 
the ReLU, which is the standard for deep neural networks, is highly flexible, we found that its unbounded activation can lead to extremely 
large uncertainty estimates. Subfigure 2(d) illustrates the overall architecture of the DNGO model. Dashed lines correspond to weights 
that are marginalized. 


The horseshoe is a so-called one-group prior for inducing 
sparsity and is a somewhat unusual choice for the weights 
of a regression model. Here we choose it because it 1) has 
support only on the positive reals, leading to convex func¬ 
tions, and 2) it has a large spike at zero with a heavy tail, re¬ 
sulting in strong shrinkage for small values while preserv¬ 
ing large ones. This last effect is important for handling 
model misspecification as it allows the quadratic effect to 
disappear and become a simple offset if necessary. 

3.2. Incorporating input space constraints 

Many problems of interest have complex, possibly un¬ 
known bounds, or exhibit undefined behavior in some re¬ 
gions of the input space. These regions can be character¬ 
ized as constraints on the search space. Recent work (Gel¬ 
bart et al., 2014; Snoek, 2013; Gramacy & Lee, 2010) has 
developed approaches for modeling unknown constraints in 
GP-based Bayesian optimization by learning a constraint 
classifier and then discounting expected improvement by 
the probability of constraint violation. 

More specifically, define c„ S {0,1} to be a binary in¬ 
dicator of the validity of input x„. Also, denote the 
sets of valid and invalid inputs as V = {(x„, t/n) | c„ = 1} 
and I = {(x„,?/„) I c„ = 0}, respectively. Note that 
V --v U I. Lastly, let be the collection of constraint 
hyperparameters. The modified expected improvement 
function can be written as 

acEi(x;T>, 0,’4') = aEi(x; V, 0)P [c = 11 x, T>,’S'] . 


In this work, to model the constraint surface, we similarly 
replace the Gaussian process with the adaptive basis model. 


integrating out the output layer weights; 

P[c = 11 X, X>,’S'] = 

r (10) 

/ P [c = 1 j X, P, w, ’S'] P(w; ’S')dw . 

J W 

In this case, we use a Laplace approximation to the pos¬ 
terior. For noisy constraints we perform Bayesian lo¬ 
gistic regression, using a logistic likelihood function for 
P [c = 1 j X, I?, w,’S']. For noiseless constraints, we re¬ 
place the logistic function with a step function. 

3.3. Parallel Bayesian Optimization 

Obtaining a closed form expression for the joint acquisi¬ 
tion function across multiple inputs is intractable in gen¬ 
eral (Ginsbourger & Riche, 2010). However, a successful 
Monte Carlo strategy for parallelizing Bayesian optimiza¬ 
tion was developed in Snoek et al. (2012). The idea is to 
marginalize over the possible outcomes of currently run¬ 
ning experiments when making a decision about a new ex¬ 
periment to run. Following this strategy, we use the pos¬ 
terior predictive distribution given by Equations 4 and 5 to 
generate a set of fantasy outcomes for each running experi¬ 
ment which we then use to augment the existing dataset. By 
averaging over sets of fantasies, we can perform approxi¬ 
mate marginalization when computing El for a candidate 
point. We note that this same idea works with the constraint 
network, where instead of computing marginalized El, we 
would compute the marginalized probability of violating a 
constraint. 

To that end, given currently running jobs with in¬ 
puts {xjl^^j^, the marginalized acquisition function 
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Experiment 

# Evals 

SMAC 

TPE 

Spearmint 

DNGO 

Branin (0.398) 

200 

0.655 ±0.27 

0.526 ±0.13 

0.398 ± 0.00 

0.398 ±0.00 

Hartmannb (-3.322) 

200 

-2.977 ±0.11 

-2.823 ±0.18 

-3.3166 ± 0.02 

-3.319 ±0.00 

Logistic Regression 

100 

8.6 ±0.9 

8.2 ±0.6 

6.88 ±0.0 

6.89 ±0.04 

EDA (On grid) 

50 

1269.6 ±2.9 

1271.5 ±3.5 

1266.2 ±0.1 

1266.2 ±0.0 

SVM (On grid) 

100 

24.1 ± 0.1 

24.2 ±0.0 

24.1 ± 0.1 

24.1 ± 0.1 


Table 1. Evaluation of DNGO on global optimization benchmark problems versus scalable (TPE, SMAC) and non-scalable (Spearmint) 
Bayesian optimization methods. All problems are minimization problems. For each problem, each method was run 10 times to produce 
error bars. 


aMCEi (■; 25, 0, is given by 

aMCEi(x; I?, ©,'!') = 

J acEi(x;X>U 

X P [{Cj,%■}/=! dj/i...dj/„dci...dc„ . 

When this strategy is applied to a GP, the cost of comput¬ 
ing El for a candidate point becomes cubic in the size of the 
augmented dataset. This restricts both the number of run¬ 
ning experiments that can be tolerated, as well as the num¬ 
ber of fantasy sets used for marginalization. With DNGO 
it is possible to scale both of these up to accommodate a 
much higher degree of parallelism. 

Finally, following the approach of Snoek et al. (2012) we 
integrate out the hyperparameters of the model to obtain 
our final integrated acquisition function. For each iteration 
of the optimization routine we pick the next input, x*, to 
evaluate according to 

X* = argmaxoMCEi(x;D, , (11) 

X 

where 

aMCEi(x;D, {xj}2^;^) = 

f , (12) 

/ aMCEi(x;X>, {xj}/^i,0,'i') ded^f. 

4. Experiments 

4.1. HPOLib Benchmarks 

In the literature, there exist several other methods for 
model-based optimization. Among these, the most popular 
variants in machine learning are the random forest-based 
SMAC procedure (Hutter et al., 2011) and the tree Parzen 
estimator (TPE) (Bergstra et al., 201 1). These are faster to 
fit than a Gaussian process and scale more gracefully with 
large datasets, but this comes at the cost of a more heuristic 
treatment of uncertainty. By contrast, DNGO provides a 
balance between scalability and the Bayesian marginaliza¬ 
tion of model parameters and hyperparameters. 


Method 

Test BLEU 

Human Expert LBL 

24.3 

Regularized LSTM 

24.3 

Soft-Attention LSTM 

24.3 

10 LSTM ensemble 

24.4 

Hard-Attention LSTM 

25.0 

Single LBL 

25.1 

2 LBL ensemble 

25.9 

3 LBL ensemble 

26.7 


Table 2. Image caption generation results using BLEU-4 on the 
Microsoft COCO 2014 test set. Regularized and ensembled 
LSTM results are reported in Zaremba et al. (2015). The baseline 
LBL tuned by a human expert and the Soft and Hard Attention 
models are reported in Xu et al. (2015). We see that ensembling 
our top models resulting from the optimization further improves 
results significantly. We noticed that there were distinct multiple 
local optima in the hyperparameter space, which may explain the 
dramatic improvement from ensembling a small subset of models. 


To demonstrate the effectiveness of our approach, we com¬ 
pare DNGO to these scalable model-based optimization 
variants, as well as the input-warped Gaussian process 
method of Snoek et al. (2014) on the benchmark set of con¬ 
tinuous problems from the HPOFib package (Eggensperger 
et al., 2013). As Table 1 shows, DNGO significantly out¬ 
performs SMAC and TPE, and is competitive with the 
Gaussian process approach. This shows that, despite vast 
improvements in scalability, DNGO retains the statistical 
efficiency of the Gaussian process method in terms of the 
number of evaluations required to find the minimum. 

4.2. Image Caption Generation 

In this experiment, we explore the effectiveness of DNGO 
on a practical and expensive problem where highly paral¬ 
lel evaluation is necessary to make progress in a reason¬ 
able amount of time. We consider the task of image cap¬ 
tion generation using multi-modal neural language models. 
Specifically, we optimize the hyperparameters of the log- 
bilinear model (LBL) from Kiros et al. (2014) to maximize 
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(a) “A person riding a wave in the ocean.” 


(b) “A bird sitting on top of a field.” (c) “A horse is riding 

a horse.” 


Figure 3. Sample test images and generated captions from the best LBL model on the COCO 2014 dataset. The first two captions 
sensibly describe the contents of their respective images, while the third is offensively inaccurate. 


the BLEU score of a validation set from the recently re¬ 
leased COCO dataset (Lin et al., 2014). In our experiments, 
each evaluation of this model took an average of 26.6 hours. 

We optimize learning parameters such as learning rate, 
momentum and batch size; regularization parameters like 
dropout and weight decay for word and image representa¬ 
tions; and architectural parameters such as the context size, 
whether to use the additive or multiplicative version, the 
size of the word embeddings and the multi-modal represen¬ 
tation size '. The final parameter is the number of factors, 
which is only relevant for the multiplicative model. This 
adds an interesting challenge, since it is only relevant for 
half of the hyperparameter space. This gives a total of 11 
hyperparameters. Even though this number seems small, 
this problem offers a number of challenges which render 
its optimization quite difficult. Eor example, in order to 
not lose any generality, we choose broad box constraints 
for the hyperparameters; this, however, renders most of the 
volume of the model space infeasible. In addition, quite 
a few of the hyperparameters are categorical, which intro¬ 
duces severe non-stationarities in the objective surface. 

Nevertheless, one of the advantages of a scalable method is 
the ability to highly parallelize hyperparameter optimiza¬ 
tion. In this way, high quality settings can be found after 
only a few sequential steps. To test DNGO in this scenario, 
we optimize the log-bilinear model with up to 800 parallel 
evaluations. 

Running between 300 and 800 experiments in parallel (de¬ 
termined by cluster availability), we proposed and evalu¬ 
ated approximately 2500 experiments—the equivalent of 
over 2700 CPU days—in less than one week. Using the 
BLEU-4 metric, we optimized the validation set perfor¬ 
mance and the best LBL model found by DNGO out¬ 
performs recently proposed models using LSTM recurrent 
neural networks (Zaremba et al., 2015; Xu et al., 2015) on 

'Details are provided in the supplementary material. 


Layer type 

# Filters 

Window 

Stride 

Convolution 

96 

3x3 


Convolution 

96 

3x3 


Max pooling 


3x3 

2 

Convolution 

192 

3x3 


Convolution 

192 

3x3 


Convolution 

192 

3x3 


Max pooling 


3x3 

2 

Convolution 

192 

3x3 


Convolution 

192 

1x1 


Convolution 

lo/Ioo 

1x1 


Global averaging 


6x6 


Softmax 


Table 3. Our convolutional neural network architecture. This 
choice was chosen to be maximally generic. Each convolution 
layer is followed by a ReLU nonlinearity. 


the test set. This is remarkable, as the LBL is a relatively 
simple approach. Ensembling this top model with the sec¬ 
ond and third best (under the validation metric) LBL mod¬ 
els resulted in a test-set BLEU score ^ of 26.7, significantly 
outperforming the LSTM-based approaches. We noticed 
that there were distinct multiple local optima in the hyper¬ 
parameter space, which may explain the dramatic improve¬ 
ment from ensembling a small number of models. We show 
qualitative examples of generated captions on test images 
in Eigure 3. Eurther figures showing the BLEU score as a 
function of the iteration of Bayesian optimization are pro¬ 
vided in the supplementary. 

^We have verified that our BLEU score evaluation is consistent 
across reported results. We used a beam search decoding for our 
test predictions with the LBL model. 
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Method 

CIFAR-10 

CIFAR-100 

Maxout 

9.38% 

38.57% 

DropConnect 

9.32% 

N/A 

Network in network 

8.81% 

35.68% 

Deeply supervised 

7.97% 

34.57% 

ALL-CNN 

7.25% 

33.71% 

Tuned CNN 

6.37% 

27.4% 


Table 4. We use our algorithm to optimize validation set error as 
a function of various hyperparameters of a convolutional neural 
network. We report the test errors of the models with the optimal 
hyperparameter configurations, as compared to current state-of- 
the-art results. 


4.3. Deep Convolutional Neural Networks 

Finally, we use DNGO on a pair of highly competitive 
deep learning visual object recognition benchmark prob¬ 
lems. We tune the hyperparameters of a deep convolutional 
neural network on the CIFAR-10 and CIFAR-100 datasets. 
Our approach is to establish a single, generic architecture, 
and specialize it to various tasks via individualized hyper¬ 
parameter tuning. As such, for both datasets, we employed 
the same generic architecture inspired by the configuration 
proposed in Springenberg et al. (2014), which was shown 
to attain strong classification results. This architecture is 
detailed in Table 5. 

For this architecture, we tuned the momentum, learning 
rate, £2 weight decay coefficients, dropout rates, standard 
deviations of the random i.i.d. Gaussian weight initializa¬ 
tions, and corruption bounds for various data augmenta¬ 
tions: global perturbations of hue, saturation and value, 
random scalings, input pixel dropout and random horizon¬ 
tal reflections. We optimized these over a validation set of 
10,000 examples drawn from the training set, running each 
network for 200 epochs. See Figure 4 for a visualization of 
the hyperparameter tuning procedure. 

We performed the optimization on a cluster of Intel® Xeon 

TM 

Phi coprocessors, with 40 jobs running in parallel using 
a kernel library that has been highly optimized for efficient 
computation on the Intel® Xeon Phi coprocessor^. For 
the optimal hyperparameter configuration found, we ran a 
final experiment for 350 epochs on the entire training set, 
and report its result. 

Our optimal models for CIFAR-10 and CIFAR-100 
achieved test errors of 6.37% and 27.4% respectively. A 
comparison to published state-of-the-art results (Goodfel- 
low et al., 2013; Wan et al., 2013; Lin et al., 2013; Lee et al., 
2014; Springenberg et al., 2014) is given in Table 4. We see 

^Available at https://github.com/orippel/ 
micmat 



Figure 4. Validation errors on CIFAR-100 corresponding to dif¬ 
ferent hyperparameter configurations as evaluated over time. 
These are represented as a planar histogram, where the shade of 
each bin indicates the total count within it. The current best vali¬ 
dation error discovered is traced in black. This projection demon¬ 
strates the exploration-versus-exploitation paradigm of Bayesian 
optimization, in which the algorithm trades off visiting unex¬ 
plored parts of the space, and focusing on parts which show 
promise. 


that the parallelized automated hyperparameter tuning pro¬ 
cedure obtains models that are highly competitive with the 
state-of-the-art in just a few sequential steps. 

A comprehensive overview of the setup, the architecture, 
the tuning and the optimum configuration can be found in 
the supplementary material. 

5. Conclusion 

In this paper, we introduced deep networks for global op¬ 
timization, or DNGO, which enables efficient optimization 
of noisy, expensive black-box functions. While this model 
maintains desirable properties of the GP such as tractabil- 
ity and principled management of uncertainty, it greatly 
improves its scalability from cubic to linear as a function 
of the number of observations. We demonstrate that while 
this model allows efficient computation, its performance is 
nevertheless competitive with existing state-of-the-art ap¬ 
proaches for Bayesian optimization. We demonstrate em¬ 
pirically that it is especially well suited to massively paral¬ 
lel hyperparameter optimization. 

While adaptive basis regression with neural networks pro¬ 
vides one approach to the enhancement of scalability, other 
models may also present promise. One promising line 
of work, for example by Nickson et al. (2014), is to in¬ 
troduce a similar methodology by instead employing the 
sparse Gaussian process as the underlying probabilistic 
model (Snelson & Ghahramani, 2005; Titsias, 2009; Hens- 
man et al., 2013). 
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Supplementary Material 

A. Convolutional neural network experiment 
specifications 

In this section we elaborate on the details of the network archi¬ 
tecture, training and the meta-optimization. In the following sub¬ 
sections we elaborate on the hyperparametrization scheme. The 
priors on the hyperparameters as well as their optimal configura¬ 
tions for the two datasets can be found in Table 6. 

A.l. Architecture 

The model architecture is specified in Table 5. 

A.2. Data augmentation 

We corrupt each input in a number of ways. Below we describe 
our parametrization of these corruptions. 
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Layer type 

# Filters 

Window 

Stride 

Convolution 

96 

3x3 


Convolution 

96 

3x3 


Max pooling 


3x3 

2 

Convolution 

192 

3x3 


Convolution 

192 

3x3 


Convolution 

192 

3x3 


Max pooling 


3x3 

2 

Convolution 

192 

3x3 


Convolution 

192 

1x1 


Convolution 

10/100 

1x1 


Global averaging 


6x6 


Softmax 


Table 5. Our convolutional neural network architecture. This 
choice was chosen to be maximally generic. Each convolution 
layer is followed by a ReLU nonlinearity. 

HSV We shift the hue, saturation and value fields of each input 
by global constants bn ~ U{—Bh,Bh), bs ~ U{—Bs,Bs), 
bv ^ U {—By, By)- Similarly, we globally stretch the saturation 
and value fields by global constants as ~ U (; 1 + ^s). 

~ + Ay). 

Scalings Each input is scaled by some factor 

Translations We crop each input to size 27 x 27, where the 
window is chosen randomly and uniformly. 

Horizontal reflections Each input is reflected horizontally 
with a probability of 0.5. 

Pixel dropout Each input element is dropped independently 
and identically with some random probability Dq. 

A.3. Initialization and training procedure 

We initialize the weights of each convolution layer m with i.i.d 
zero-mean Gaussians with standard deviation where Fm is 
the number of parameters per filter for that layer. We chose this 
parametrization to produce activations whose variances are invari¬ 
ant to filter dimensionality. We use the same standard deviation 
for all layers but the input, for which we dedicate its own hyper¬ 
parameter a I as it oftentimes varies in scale from deeper layers in 
the network. 

We train the model using the standard stochastic gradient descent 
and momentum optimizer. We use minibatch size of 128, and 
tune the momentum and learning rate, which we parametrize as 
1 — 0.1*^ and 0.1^ respectively. We anneal the learning rate by 
a factor of 0.1 at epochs 130 and 190. We terminate the training 
after 200 epochs. 

We regularize the weights of all layers with weight decay coef¬ 


ficient W. We apply dropout on the outputs of the max pooling 
layers, and tune these rates D\ , D 2 separately. 

A. 4. Testing procedure 

We evaluate the performance of the learned model by averag¬ 
ing its log-probability predictions on 100 samples drawn from 
the input corruption distribution, with masks drawn from the unit 
dropout distribution. 

B. Additional figures for image caption 
generation 

In Eigures 5(a) and 5(b) we provide some additional visualization 
of the results from the image caption generation experiments from 
Section 4.2 to highlight the behavior of the Bayesian optimization 
routine. Both figures show the validation BLEU-4 Score on MS 
COCO corresponding to different hyperparameter configurations 
as evaluated over iterations of the optimization procedure. In Eig- 
ure 5(a), these are represented as a planar histogram, where the 
shade of each bin indicates the total count within it. The current 
best validation score discovered is traced in black. Eigure 5(b) 
shows a scatter plot of the validation score of all the experiments 
in the order in which they finished. Validation scores of 0 cor¬ 
respond to constraint violations. These figures demonstrate the 
exploration-versus-exploitation paradigm of Bayesian Optimiza¬ 
tion, in which the algorithm trades off visiting unexplored parts 
of the space, and focusing on parts which show promise. 

C. Multimodal neural language model 
hyperparameters 

C.l. Description of the hyperparameters 

We optimize a total of 11 hyperparameters of the log-bilinear 
model (LBL). Below we explain what these hyperparameters refer 
to. 

Model The LBL model has two variants, an additive model 
where the image features are incorporated via an additive bias 
term, and a multiplicative that uses a factored weight tensor to 
control the interaction between modalities. 

Context size The goal of the LBL is to predict the next word 
given a sequence of words. The context size dictates the number 
of words in this sequence. 

Learning rate, momentum, hatch size These are opti¬ 
mization parameters used during stochastic gradient learning of 
the LBL model parameters. The optimization over learning rate 
is carried out in log-space, but the proposed learning rate is expo¬ 
nentiated before being passed to the training procedure. 

Hidden layer size This controls the size of the joint hidden 
representation for words and images. 

Embedding size Words are represented by feature embed¬ 
dings rather than one-hot vectors. This is the dimensionality of 
the embedding. 

Dropout A regularization parameter that determines the 
amount of dropout to be added to the hidden layer. 

























Scalable Bayesian Optimization Using Deep Neural Networks 




(a) Heatmap 


(b) Scatter Plot 


Figure 5. Validation BLEU-4 Score on MS COCO corresponding to different hyperparameter configurations as evaluated over time. In 
Figure 5(a), these are represented as a planar histogram, where the shade of each bin indicates the total count within it. The current best 
validation score discovered is traced in black. Figure 5(b) shows a scatter plot of the validation score of all the experiments in the order 
in which they finished. This projection demonstrates the exploration-versus-exploitation paradigm of Bayesian Optimization, in which 
the algorithm trades off visiting unexplored parts of the space, and focusing on parts which show promise. 


Context decay, Word decay £2 regularization on the input 
and output weights respectively. Like the learning rate, these are 
optimized in log-space as they vary over several orders of magni¬ 
tude. 

Factors The rank of the weight tensor. Only relevant for the 
multiplicative model. 
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Hyperparameter 

Notation 

Support of prior 

CIFAR-10 Optimum 

CIFAR-100 Optimum 

Momentum 

M 

[0.5,2] 

1.6242 

1.3339 

Learning rate 

L 

[1,4] 

2.7773 

2.1205 

Initialization deviation 

cri 

[0.5,1.5] 

0.83359 

1.5570 

Input initialization deviation 

a 

[0.01,1] 

0.025370 

0.13556 

Hue shift 

Bh 

[0,45] 

31.992 

19.282 

Saturation scale 

As 

[0,0.5] 

0.31640 

0.30780 

Saturation shift 

Bs 

[0,0.5] 

0.10546 

0.14695 

Value scale 

As 

[0,0.5] 

0.13671 

0.13668 

Value shift 

Bs 

[0,0.5] 

0.24140 

0.010960 

Pixel dropout 

Do 

[0,0.3] 

0.19921 

0.00056598 

Scaling 

S 

[0,0.3] 

0.24140 

0.12463 

L2 weight decay 

W 

[2,5] 

4.2734 

3.1133 

Dropout 1 

Di 

[0,0.7] 

0.082031 

0.081494 

Dropout 2 

D2 

[0,0.7] 

0.67265 

0.38364 


Table 6. Specification of the hyperparametrization scheme, and optimal hyperparameter configurations found. 


Hyperparameter 

Support of prior 

Notes 

COCO Optimum 

Model 

{additive,multiplicative} 


additive 

Context size 

[3,25] 


5 

Learning rate 

[0.001,10] 

Log-space 

0.43193 

Momentum 

[0,0.9] 


0.23269 

Batch size 

[20,200] 


40 

Hidden layer size 

[100,2000] 


441 

Embedding size 

{50,100,200} 


100 

Dropout 

[0,0.7] 


0.14847 

Word decay 

[10-^10-3] 

Log-space 

2.98456-'^ 

Context decay 

[10-^10-3] 

Log-space 

1.09181"® 

Factors 

[50,200] 

Multiplicative model only 

- 


Table 7. Specification of the hyperparametrization scheme, and optimal hyperparameter configurations found for the multimodal neural 
language model. For parameters marked log-space, the log is given to the Bayesian optimization routine and the result is exponentiated 
before being passed into the multimodal neural language model for training. Square brackets denote a range of parameters, while curly 
braces denote a set of options. 




